Support RL online quantization with torchao#23014
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to initialize TorchAOConfig from a file, which is a great step towards enabling on-the-fly quantization. The changes span configuration, the torchao quantization layer, and weight loading utilities. While the overall direction is good, I've identified a few critical and high-severity issues. There's a significant logic bug in weight_utils.py that seems to prevent the feature from working on models that are not already quantized. Another critical issue is that dummy weight initialization for profiling has been commented out, which will likely break profiling runs. Additionally, I've pointed out a couple of high-severity issues in the new torchao code related to a potential TypeError from an unsafe method signature and a hardcoded dtype marked as a "temp hack". I've provided specific suggestions to address each of these points.
Summary: Only supporting quantizing all linear layers with torchao config for now. see vllm PR for how to generate the quantization file. Also requires vllm changes: vllm-project/vllm#23014 Test Plan: sh examples/ppo_trainer/run_deepseek7b_llm.sh Reviewers: Subscribers: Tasks: Tags:
Summary: Only supporting quantizing all linear layers with torchao config for now. see vllm PR for how to generate the quantization file. Also requires vllm changes: vllm-project/vllm#23014 Test Plan: sh examples/ppo_trainer/run_deepseek7b_llm.sh Reviewers: Subscribers: Tasks: Tags:
|
waiting on verl to confirm the API changes make sense first, before cleaning up this PR for review |
9a3bf05 to
c8b5d20
Compare
c8b5d20 to
7548891
Compare
a0f395b to
bf57db6
Compare
bf57db6 to
8538d42
Compare
2b3cfe0 to
b073702
Compare
|
can't repro quantization test timeout locally, rebasing and running the tests again to see if persists |
|
OK quantization tests passed, Language Model Tests failing but they are not related to the changes I think also saw the Language Model Tests failing in main: https://buildkite.com/vllm/ci/builds/33089/steps/canvas I think it's safe to merge now |
Summary: Only supporting quantizing all linear layers with torchao config for now. see vllm PR for how to generate the quantization file. Also requires vllm changes: vllm-project/vllm#23014 Test Plan: sh examples/ppo_trainer/run_deepseek7b_llm.sh Reviewers: Subscribers: Tasks: Tags:
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>
Summary: Only supporting quantizing all linear layers with torchao config for now. see vllm PR for how to generate the quantization file. Also requires vllm changes: vllm-project/vllm#23014 Test Plan: sh examples/ppo_trainer/run_deepseek7b_llm.sh Reviewers: Subscribers: Tasks: Tags:
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>
Summary: Only supporting quantizing all linear layers with torchao config for now. see vllm PR for how to generate the quantization file. Also requires vllm changes: vllm-project/vllm#23014 Test Plan: sh examples/ppo_trainer/run_deepseek7b_llm.sh Reviewers: Subscribers: Tasks: Tags:
Summary: Only supporting quantizing all linear layers with torchao config for now. see vllm PR for how to generate the quantization file. Also requires vllm changes: vllm-project/vllm#23014 Test Plan: sh examples/ppo_trainer/run_deepseek7b_llm.sh Reviewers: Subscribers: Tasks: Tags:
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>
Summary: Only supporting quantizing all linear layers with torchao config for now. see vllm PR for how to generate the quantization file. Also requires vllm changes: vllm-project/vllm#23014 Test Plan: sh examples/ppo_trainer/run_deepseek7b_llm.sh Reviewers: Subscribers: Tasks: Tags:
Summary: Only supporting quantizing all linear layers with torchao config for now. see vllm PR for how to generate the quantization file. Also requires vllm changes: vllm-project/vllm#23014 Test Plan: sh examples/ppo_trainer/run_deepseek7b_llm.sh Reviewers: Subscribers: Tasks: Tags:
Summary: Only supporting quantizing all linear layers with torchao config for now. see vllm PR for how to generate the quantization file. Also requires vllm changes: vllm-project/vllm#23014 Test Plan: sh examples/ppo_trainer/run_deepseek7b_llm.sh Reviewers: Subscribers: Tasks: Tags:
Summary: Only supporting quantizing all linear layers with torchao config for now. see vllm PR for how to generate the quantization file. Also requires vllm changes: vllm-project/vllm#23014 Test Plan: sh examples/ppo_trainer/run_deepseek7b_llm.sh Reviewers: Subscribers: Tasks: Tags:
Summary: Only supporting quantizing all linear layers with torchao config for now. see vllm PR for how to generate the quantization file. Also requires vllm changes: vllm-project/vllm#23014 Test Plan: sh examples/ppo_trainer/run_deepseek7b_llm.sh Reviewers: Subscribers: Tasks: Tags:
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>
**Summary:** Existing support for `load_in_fp8=True` performs an offline quantization when loading the initial model. This is no longer necessary as of vllm==0.12.0 (after vllm-project/vllm#23014), where we can quantize the model on-the-fly when we load it: ``` llm = LLM( ... hf_overrides={ "quantization_config_dict_str": json.dumps(torchao_config), }, ) ``` **Test Plan:** https://gist.github.com/andrewor14/5b85119fae46845d07b608d420907423
Summary:
This is to enable online quant for verl. The PR
added support for initializing a TorchAOConfig object in vllm
through a serialized json file that specifies the type of quantization
people want. Or a json serialized TorchAOConfig object
Code for serializing the config to json:
Code for serializing the config to file
This also supports module level config as well through the
ModuleFqnToConfigconfighttps://huggingface.co/docs/transformers/main/en/quantization/torchao#per-module-quantization
although not tested yet.
more configs: https://docs.pytorch.org/ao/main/api_ref_quantization.html#inference-apis-for-quantize
Note: this has incorporated changes from @LiyuanLucasLiu's PR: #23901, although vllm fp8 quant method is not supported yet, we can add that in a separate PR
Test Plan:
pytest tests/quantization/test_torchao.py -k test_on_the_fly_quant
pytest tests/quantization/test_torchao.py -k test_reload_weights
and regression tests
pytest tests/quantization/test_torchao.py
Reviewers:
Subscribers:
Tasks:
Tags: